Introduction
This report explores the dataframes from April 2016 (sense_my_feup.Rmd). The data is soon filtered to consider only points in the Porto square. Finally, it concludes the main edges for better evaluation in the next chapters.
Index:
- Data setup
- Intersession time
- Mapping
- Sessions
- Speed
- Time series (first attempt)
SenseMyFEUP data
Data loading
Data used is previously filtered by travelmode (car and bus) and date (April 2016).
We filter our data to consider only points in a square of the Porto Metropolitan area.
Intersession times
This section explores the time between different sessions in the same edge.
### Intersession times and number of sessions.
All intersession time Porto April16. This figure shows that most of the edges have an intersession time greater than 1 hour.
Intersession time along the week. Here we can see that weekends have an intersession time way bigger than weekdays, this could be caused by users going out of Porto on the weekends, reaching out our frontier.
We recommend that you use the dev version of ggplot2 with `ggplotly()`
Install it with: `devtools::install_github('hadley/ggplot2')`
Number of sessions per edge. We can see the limitations in our data when most of the edges have less than 5 sessions in the whole month, mostly even having just 1. In next chapters we will study more the edges with more than 50 sessions.
Analizing edges >50 sessions.
- hotedges -> more 50 sessions. 3537 sessions in total.
- superhot_edges -> more 170 sessions. 1369 sessions in total.
Points (probably to delete)
This section is needed to then map. Maps the amount of points. It is better to map with number of sessions.
ECDF
Only 25% is under 1h intersession time.
As expected, there is higher intersession time at night.
ECDF by #sessions

Edges with >50 sessions.
We recommend that you use the dev version of ggplot2 with `ggplotly()`
Install it with: `devtools::install_github('hadley/ggplot2')`
We recommend that you use the dev version of ggplot2 with `ggplotly()`
Install it with: `devtools::install_github('hadley/ggplot2')`
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Removed 248 rows containing non-finite values (stat_bin).
Showing maps
Traffic Map April Porto 2016 all
Traffic Map April 2016 all >50
Top edges
Number of sessions analysis.
Day of the week
Scale for 'x' is already present. Adding another scale for 'x', which will replace the existing scale.

By day
Scale for 'x' is already present. Adding another scale for 'x', which will replace the existing scale.

By hour

Speed >50

Speed by week

Speed by day

Speed by hour
Number of session per hour
Edges per half hour
prueba <- table(cut(filter(df_hotedges_april16pt, day(time) < 7)$time, breaks = "30 mins"))
plot(prueba, xlab = "date", ylab = "frequency")
---
title: 'Sense My FEUP - April 2016 Data'
author: "Daniela S. Gil"
date: 25-08-2017
output: html_notebook
---

# Introduction 

This report explores the dataframes from April 2016 (sense_my_feup.Rmd).
The data is soon filtered to consider only points in the Porto square. 
Finally, it concludes the main edges for better evaluation in the next chapters. 

#### Index: 

  * Data setup 
  * Intersession time
  * Mapping 
  * Sessions 
  * Speed
  * Time series (first attempt)

--------------------------------------------------------------------------------

```{r echo = FALSE, eval=FALSE}
save.image("SensemyWorkSpace.RData")
#load("df_first_session_cars_april16.Rda")
load("SensemyWorkSpace.RData")
source("../functions/comparing_edges_level.R")
```


# SenseMyFEUP data

## Data loading
 Data used is previously filtered by travelmode (car and bus) and date (April 2016).      
 We filter our data to consider only points in a square of the Porto Metropolitan area. 
 
```{r echo=FALSE, eval=FALSE}
#Filtering by rectangle in Porto 

df_edges_april16 <- df_speed

df_edges_april16pt <- df_speed %>% 
  filter(lat < 41.1859352808155, lat > 41.1364726546, lon > -8.6912940681405, lon < -8.55396934228)
```

## Intersession times
  This section explores the time between different sessions in the 
same edge.  
### Intersession times and number of sessions.
```{r echo=FALSE, eval=FALSE}

# Transforming seconds to timestamp 

df_edges_april16pt$time <- as.POSIXct(df_edges_april16pt$seconds, origin="1970-01-01")
df_speed_top$time <- as.POSIXct(df_speed_top$seconds, origin="1970-01-01")
df_speed$time <- as.POSIXct(df_speed$seconds, origin="1970-01-01")

# Getting data from df_speed

df_intersession_april16pt <- df_edges_april16pt %>% 
  group_by(way_id, session_id) %>% 
  summarise(seconds = min(seconds))

# Calculating intersession time.

df_intersession_april16pt <- df_intersession_april16pt  %>%
  arrange(desc(way_id), time) %>% 
  mutate(intersession_time = c(0,as.numeric(diff(time), units="mins")))

df_intersession_april16pt$time <- as.POSIXct(df_intersession_april16pt$seconds, origin="1970-01-01")

```

All intersession time Porto April16. 
This figure shows that most of the edges have an intersession time greater than 1 hour.
```{r echo=FALSE}
plot_ly(y = df_intersession_april16pt$intersession_time, type = "box", name = "All way_ids") %>% 
  add_boxplot(y= filter(df_intersession_april16pt, intersession_time != 0 )$intersession_time, name = "At least 2 sessions") %>% 
  layout(title="Intersession time (mins)", yaxis = list(range = c(0,5000)))

```

Intersession time along the week.
Here we can see that weekends have an intersession time way bigger than weekdays, this 
could be caused by users going out of Porto on the weekends, reaching out our frontier. 
```{r echo=FALSE}
p <- ggplot(df_intersession_april16pt, aes(x = weekdays(time), y = intersession_time, fill = interaction(weekdays(time)) )) + 
  geom_boxplot() + 
  xlab("Days of the week" ) +
  ylab("Intersession time (mins)") +
  ggtitle("Intersession time(min) in the week ") + 
  theme(legend.position="none") +
    scale_x_discrete(limits = c("Domingo", "Segunda", "Terça", "Quarta", "Quinta", 
    "Sexta", "Sábado")) +
  coord_cartesian(ylim = c(0,10000))


p <- plotly_build(p)
p

```


```{r echo=FALSE, eval= FALSE}
# Histogram intersession time.
int2 <- df_intersession_april16pt %>%
  filter( intersession_time != 0) %>% 
  ggplot(aes(intersession_time/60)) + 
  geom_histogram(binwidth = 9) + 
  ggtitle("Intersession time (hours)") +
  xlab("Hours") +
  coord_cartesian(xlim=c(0,200))
ggplotly(int2)


```

Number of sessions per edge.
We can see the limitations in our data when most of the edges have less than 5 sessions
in the whole month, mostly even having just 1. 
In next chapters we will study more the edges with more than 50 sessions. 

```{r echo=FALSE,  message=FALSE}
int1 <- df_intersession_april16pt %>% 
  group_by(way_id) %>% 
  summarise(n_sessions = n()) %>% 
  ggplot(aes(x =n_sessions)) +
  geom_histogram(binwidth = 10) +
  coord_cartesian(xlim = c(0,150)) +
  ggtitle("Number of sessions per way_id in Porto") +
  xlab("Number of sessions") +
  geom_vline(xintercept = 50, size = 1, colour = "#FF3721",
                   linetype = "dashed")

ggplotly(int1)
```

### Analizing edges >50 sessions.  
* hotedges -> more 50 sessions. 3537 sessions in total.
* superhot_edges -> more 170 sessions. 1369 sessions in total. 

```{r echo=FALSE}
#Create df with way_ids with >50 sessions

df_id_hotedges_april16pt <- df_intersession_april16pt %>% 
  group_by(way_id) %>% 
  summarise(n_sessions = n()) %>%  
  filter(n_sessions >= 50) 

df_hotedges_april16pt <- df_id_hotedges_april16pt %>% 
  merge(y= df_edges_april16pt, by="way_id")

#Create df with way_ids with >170 sessions 
#This is NO LONGER used. 
#Updated with df_superhotedges_april16pt created in  preanalysis_course.Rmd
df_id_superhotedges_april16pt <- df_intersession_april16pt %>% 
  group_by(way_id) %>% 
  summarise(n_sessions = n()) %>%  
  filter(n_sessions >= 170) 

df_superhotedges_april16pt <- df_id_superhotedges_april16pt %>% 
  merge(y= df_edges_april16pt, by="way_id")

n_distinct(df_hotedges_april16pt$session_id)
```

##### Points (probably to delete) 
This section is needed to then map. Maps the amount of points.
It is better to map with number of sessions. 
```{r echo=FALSE, eval=FALSE} 
# POINTS for mapping

#All with sessions > 50
df_points_hotedges_april16pt <- df_intersession_april16pt %>% 
  group_by(way_id) %>% 
  summarise(n_sessions = n()) %>% 
  filter(n_sessions >= 50) %>% 
  merge( y = df_points_edge_osm_april16pt, by = "way_id") 

df_points_superhotedges_april16pt <- df_intersession_april16pt %>% 
  group_by(way_id) %>% 
  summarise(n_sessions = n()) %>% 
  filter(n_sessions >= 170) %>% 
  merge( y = df_points_edge_osm_april16pt, by = "way_id") 


df_superhotedges_april16pt <-mutate(df_superhotedges_april16pt, class = ifelse(hour(time) >=7 & hour(time) <= 20, 
                        "day",
                        "night"))
()
#Day
df_points_hotedges_april16pt_d <- df_hotedges_april16pt %>% 
  filter(hour(time) >=7, hour(time)<= 20, n_sessions >= 50) %>% 
  merge( y = df_points_edge_osm_april16pt, by = "way_id") 

# Night
df_points_april16pt_night_n <- df_hotedges_april16pt %>% 
  filter(hour(time) <7 | hour(time) > 20, n_sessions >= 20) %>% 
  merge( y = df_points_edge_osm_april16pt, by = "way_id") 

```

### ECDF 
Only 25% is under 1h intersession time. 
```{r echo=FALSE}
e1 <- ggplot(subset(df_intersession_april16pt, intersession_time > 0), aes(intersession_time)) + 
  stat_ecdf(geom = "step") +
  xlab("Intersession time(mins)")

e2 <- ggplot(subset(df_intersession_april16pt, intersession_time > 0), aes(intersession_time)) + 
  scale_x_log10() +stat_ecdf(geom = "step")  + xlab("Intersession time Log")

grid.arrange(e1, e2, ncol= 2, top = "Intersession time all sessions Porto (mins)")
```
 
As expected, there is higher intersession time at night. 
```{r echo=FALSE, eval=FALSE}
# Classifying points by day for ECDF 
df_intersession_april16pt <-mutate(df_intersession_april16pt, class = ifelse(hour(time) >=7 & hour(time) <= 20, 
                        "day",
                        "night"))

df_intersession_april16pt %>% 
  filter(intersession_time >0) %>% 
  subset(way_id %in% df_hotedges_april16pt$way_id) %>% 
  ggplot( aes(intersession_time, color= class )) + 
  scale_x_log10() + 
  stat_ecdf(geom = "step") + 
  ggtitle("Intersession time >50 sessions Porto")
```


ECDF by #sessions 

```{r echo=FALSE}
df_intersession_april16pt %>% 
  group_by(way_id) %>% 
  summarise(sessions = n() ) %>% 
  ggplot(aes(sessions)) + 
  scale_x_log10() + stat_ecdf(geom = "step") +
  ggtitle("ECDF number of sessions all Porto")
  
```

###  Edges with >50 sessions. 

```{r echo=FALSE, eval=TRUE}
low1 <- df_intersession_april16pt %>% 
  subset(way_id %in% df_hotedges_april16pt$way_id) %>% 
  subset( intersession_time < quantile(df_intersession_april16pt$intersession_time, 0.35)) %>% 
  ggplot(aes(intersession_time)) +
  geom_histogram(bins = 30) + 
  scale_x_continuous(breaks = seq(0,200,10))+
  xlab("Intersession time (mins)") + 
  ggtitle("Lowest 35% intersession time >50")

ggplotly(low1)
```


```{r echo=FALSE, eval=TRUE}
c1 <- df_intersession_april16pt %>% 
  subset(way_id %in% df_hotedges_april16pt$way_id) %>% 
  ggplot(aes(intersession_time, fill = class)) + 
  geom_histogram() + 
  xlab("Intersession time (min)") +
  xlim(c(0,3000)) +
  ylim(c(0,5000)) +
  ggtitle("Intersession time >50 day/night")

ggplotly(c1) 
```



### Showing maps 

Traffic  Map April Porto 2016 all
```{r eval= TRUE}
# Showing Map.
m1 <- m
m1
#mapshot(m, url = paste0(getwd(), "/map.html"))
```

Traffic  Map April 2016 all >50
```{r echo=FALSE}
m_50 <- m
m_50 
```


Top edges 

```{r echo=FALSE}
m_hot <- m
m_hot
```

```{r echo=FALSE}
m_100 <- m
m_100
```


### Number of sessions analysis.
#### Day of the week 
```{r echo= FALSE} 
plot.n_sessions(df_speed,df_hotedges_april16pt, df_speed_top, wday) + 
  xlab("Day of the week")  +
  scale_x_discrete(limits = c("Domingo", "Segunda", "Terça", "Quarta", "Quinta", 
    "Sexta", "Sábado")) 

```

#### By day 
```{r echo= FALSE}
plot.n_sessions(df_speed, df_hotedges_april16pt, df_speed_top, day) + 
  xlab("Day") +
  scale_x_continuous(breaks = seq(1,30,1)) 
```

#### By hour 
```{r echo= FALSE} 
plot.n_sessions(df_speed,df_hotedges_april16pt, df_speed_top, hour) + 
  xlab("Hour") +
  scale_x_continuous(breaks = seq(0,23,1)) 
```






## Speed >50

```{r}
# Speed by way_id and session.\
df_speed %>% 
  subset(way_id %in% df_hotedges_april16pt$way_id) %>% 
  group_by(way_id) %>% 
  summarise(avg_speed = mean((speed*18)/5), n = n() ) %>% 
  ggplot( aes(avg_speed)) +
  geom_histogram(binwidth = 3) 
```

Speed by week 
```{r echo=FALSE}
plot.speed(df_speed,df_hotedges_april16pt, df_speed_top, wday, median) + 
  xlab("Day of the week")  +
  scale_x_discrete(limits = c("Domingo", "Segunda", "Terça", "Quarta", "Quinta", 
    "Sexta", "Sábado")) 
```

Speed by day

```{r echo=FALSE}
plot.speed(df_speed, df_hotedges_april16pt, df_speed_top, day) + 
  xlab("Day") +
  scale_x_continuous(breaks = seq(1,30,1)) 

```

Speed by hour
```{r echo=FALSE}
plot.speed(df_speed,df_hotedges_april16pt, df_speed_top, hour) + 
  xlab("Hour") +
  scale_x_continuous(breaks = seq(0,23,1)) 
```

##
Number of session per hour
```{r echo=FALSE}
df_hotedges_april16pt %>% 
    filter(wday(time) != 1 , wday(time) != 7 ) %>% 
    group_by(hour =hour(time)) %>% 
    summarise(sessions = n_distinct(session_id)) %>% 
  ggplot(aes(hour, sessions)) + 
  geom_line()+
  scale_x_continuous(breaks = seq(0,23,1))
```

Edges per half hour
```{r}
prueba <- table(cut(filter(df_hotedges_april16pt, day(time) < 7)$time, breaks = "30 mins"))
plot(prueba, xlab = "date", ylab = "frequency") 
```

